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(57) ABSTRACT 

MDice control of a service application provided to a terminal 
&om a remote server is distributed between the terminal and 
a remote application part. A relatively low power automatic 
speech recognition system (ASR) is provided in the terminal 
for recognizing those portions of user-supplied audio input 
that relate to terminal functions or functions defined by a 
predefined markup language. Recognized words may be 
used to control the terminal functions, or may alternatively 
be converted to text and forwarded to the remote server. 
Unrecognized portions of the audio input may be encoded 
and forwarded to the remote appUcalion part which includes 
a more powerful ASR. The remote application part may use 
its ASR to recognize words defined by the application. 
Recognized words may be converted to text and supplied as 
input to the remote server. In the reverse direction, text 
received by the remote appUcation part from the remote 
server may be converted to an encoded audio output signal, 
and forwarded to the terminal, which can then generate a 
signal to be supplied to a loudspeaker In this way, a voice 
control mechanism is used in place of the remote server's 
visual display output and keyboard input. 

34 Claims, 5 Drawing Sheets 
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VOICE CONTROL OF A USER INTERFACE An implementation of the speech recognizer in the ter- 

TO SERVICE APPLICATIONS minal will only have to recognize one user (or a very small 

number of users) so adaptive training may be used. The 

BACKGROUND required processing for a combined word ASR may still be 
The present invention relates generally to control of S too large to be implemented in the terminal. For example, 

service applications, more particularly to voice control of th« processmg power of today's mobile tcrmmals (such as 

service applications, and still more particularly to voice ^^ose employed in ceUular telephone systems, personal 

control of service appHcations from a remote terminal. digital assistants, and special purpose wireless tcraunals) is 

The most common type of terminal for Internet access is ^^,f implementing an isolated word ASR with a 

a conventional personal computer (PC) teraiinal with a ^"^^^ vocabulary (e.g. for dialmg and for accessing a 

1 u u f ^- ^ A U r^u A,.*^ personal telephone book stored m the terminal). Traimng 

large, high resolution display and a relatively high data ^ , . , ^ jj- j . .i_ l i 

* • • u J 'A*u TiTu .,„vu^-- may be required for adding new words to the vocabulary, 

transmission bandwidth. When a user wishes to use an ^ ° ^ 

Internet connection to control a service application located ^ problem that exists in present day centralized server 

at a remote location, he or she typically uses the keyboard ASRs is that a voice channel (voice call) has to be estab- 

associated with the PC terminal, and types in commands. ^shed between the terminal and a gateway or server that 

The data is communicated via the Internet to the service performs voice recognition. However, a voice channel may 

application, which can then respond accordingly. The user's introduce distortion, echoes and noise that will degrade the 

PC terminal display permits response information to be recognition performance. 

displayed in the form of text and/or graphics which can A centralized ASR is also an expensive and limited 

easily be viewed by the user. network resource that will require a high processing 

The recent standardization of a Wireless Application capability, a large database and an adaptive training capa- 

Protocol (WAP), using the Wireless Markup Language bility for the individual voices and dialects in order to bring 

(WML), has enabled terminals with small displays, limited down the failure rate in the recognition process. Because it 

processing power and a low data transmission bandwidth is a limited resource, the central server or gateway may need 

(e.g., digital cellular phones and terminals) to access and to implement a dial up voice channel access capability, 
control services and content in a service network such as the The new generation of WAP -supported mobUe terminals 

Internet. WAP is a layered communication protocol that will be able to control and interact with a large variety of 

includes network layers (e.g., transport and session layers) services and content. However, the terminal display and 

as well as an application environment including a keyboard typically have very limited input/output (I/O) 

microbrowser, scripting, telephony value-added services capability, thereby making a voice-controlled interface 

and content formats. The simple syntax and limited vocabu- desirable. As explained above, today's low-cost terminals 

lary in WML makes WAP suitable for controlling the service can support some ASR capability, but this is inadequate to 

and for interacting with the content from a client terminal support voice access to a multi-user Application Server that 

with low processing and display capabilities. will require a large vocabulary or a time-consuming training 

While the ability to use these smaller terminals is a major of the recognizer for each application, 
convenience to the user (who can more readily carry these (jttmmarv 
along on various journeys), reading selection menus and ^VM 

other large amounts of text (e.g., e-mail and help text) from It is therefore an object of the present invention to provide 

a small display and typing in responses on a small keyboard methods and apparatuses for enabhng relatively low power 

with multi-function keys has some disadvantages. These terminals to access and control remote server applications 

disadvantages may be largely overcome by the substitution via a voice controlled interface. 

of a voice-controlled interface to the service application. A j^^ foregoing and other objects are achieved in methods 
voice-controlled interface is also useful for providing apparatuses for controlling a service application pro- 

"hands-free" operation of a service application, such as 45 vj^cd to a terminal from a remote server. In accordance with 
would be required when the user is driving a car ^^^^^ j^e invention, this is achieved by receiving an 

Automatic speech recognition systems (ASR) are known. audio input signal representing audio information, and using 

An ASR for supporting a voice -controlled application may a first automatic speech recognition system located in the 

be a user shared resource in a central server or a resource in terminal to determine whether the audio input signal 
the client terminal. The simpler ASR recognizes isolated 50 includes one or more words defined by a first vocabulary, 

words with a pause in-between words, whereas the advanced wherein portions of the audio input signal not corresponding 

ASR is capable of recognizing connected words. The com- to the one or more words defined by the first vocabulary 

plexity of the ASR increases with the size of the vocabulary constitute an unrecognized portion of the audio input signal, 

that has to be recognized in any particular instance of the If the audio input signal includes one or more words defined 
dialog with the application. 55 by the first vocabulary, then a terminal application part of an 

If the ASR is implemented at a central server, it must be application protocol service logic is used to determine what 

capable of recognizing many users with dififerent languages, to do with the one or more words defined by the first 

dialects and accents. Conventional speaker-independent vocabulary. The unrecognized portion of the audio input 

speech recognition systems normally use single word ASR signal is formatted for inclusion in a data unit whose 
with a very limited vocabulary (e.g., "yes", "no", "one**, 60 structure is defined by a first predefined markup language, 

"two", etc.) to reduce the amount of required processing and The data unit is communicated to a remote appUcation part 

to keep the failure rate low. Another alternative for improv- via a first digital data link that operates in accordance with 

ing the accuracy of recognition is to make the speech a first application protocol. In the remote application part, 

recognition adaptive to the user by training the recognizer on the formatted unrecognized portion of the audio input signal 
each user's individual voice, and asking the user to repeat or 65 is extracted from the data imit. A remote application part 

spell a misunderstood word. In a multi-user environment, service logic is then used to determine what to do with the 

each user's profile must be stored. formatted unrecognized portion of the audio input signal. 
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In another aspect of the invention, the audio input signal In the terminal, the audio output signal is extracted from the 

is in the form of compressed digitally encoded speech. second data unit and a loudspeaker signal is generated 

In yet another aspect of the invention, if the audio input therefrom, 

signal includes one or more words defined by the first ^^„^^.«™^vt r^^wr- r^r, .«Tt»T^r. 

vocabulary, then the terminal appUcation part of the appli- 5 BRIEF DESCRIPTION OF THE DRAWINGS 

cation protocol service logic causes the one or more words The objects and advantages of the invention will be 

to be used to select one or more terminal functions to be understood by reading the following detailed description in 

performed. conjunction with the drawings in which: 

In still another aspect of the invention, the one or more piQS. la and lb are block diagrams of alternative 

terminal functions include selecting a current menu item as embodiments of a distributed VCSAin accordance with one 

a response to be supphed to the remote server. aspect of the invention; 

In yet another aspect of the invention, a current menu item piG. 2 is a block diagram of an exemplary embodiment of 

is associated with a first selection; and the one or more a voice controlled remote server in accordance with the 

terminal functions include associating the cunent menu item invention; 

with a second selection that is not the same as the first FIG.3 is a block diagram of an exemplary embodiment of 

selection. ^ voice controlled remote server in accordance with the 

In still another aspect of the invention, if the audio input invention; 

signal includes one or mom words defined by the first ^ ^ ^ flowchart of operations performed by the 

vocabulary, then the tenninal apphcation part of the appli- ^ ^^^^^^ 

application part in accordance with an exemplary 

cation protocol service logic causes a corresponding mes- embodiment of the invention; and 

sage to be generated and communicated to the remote ^.o .^j -.- , lj- 

application pari via the first digital data Unk. In some FIG. 5 is a flowchart depicSmg an exemplary embodiment 

embodiments, the corresponding message may include state °^ "P""""" °l '° "'^o-'dance with an 

information, text or binary data. ^5 '""^"^P^'^y embodiment of the mvention. 

In yet another aspect of the invention, the remote apph- DETAILED DESCRIPTION 
cation part forwards the corresponding message to the 

remote server '^^^ various features of the invention will now be 

^ .„ ' . f • . 1- described with respect to the figures, in which like parts are 

In still another aspect of the mvention. the remote appL- .^^^^^^^ ^/^^^ ^^^^J^^ characters. TTie following 

cation part forwards the corresponding message to the 30 d^^ ji^^ ^.^j^ ^AP and WML standards as a basit 

remote server via a second digital data unk that Operates 10 r v i • i i i * • i » 

, J f. .1 Tn. c . for hnking a relatively low-power terminal to a remote 

accordance with a second apphcation protocol. The first ,. tt n u • j *u * 

, , 1 5 J . L .t- XL application. However, it will be recognized that these Stan- 

" ■ 35 inventive concepts utiUzed here are equally applicable in 

In yet another aspect of the invention, a second automatic o^her environments that do not operate in accordance with 

speech recognition system located in the remote application particular standards. 

part is used to determine whether the unrecognized portion invention, components of a Voice 

of the audio mpul signal includes one or more words deto^ Controlled Service Application (VCSA) are distributed 

by a second vocabulary. If the unrecogmzed portion of the ^^^^^^ implemented in the terminal, and remaining 

audioinputsignalmcludesoneormorewordsdefinedby ^^^^ implemented in remote equipment. FIGS, la 

second vocabulary, then the remote apphcaUon part service block diagrams of alternative embodiments of a 

logic IS used to determme what to do with the one or more yCSA in accordance with this aspect of the 

words defined by the second vocabulary. invention. In HG. la. a client part 101 is coupled to a server 

In still another aspect of the mvention, the first vocabulary p^rt 103 via a first digital link 105. The client part 101 is 

exclusively includes words defined by a syntax of the first implemented in the terminal, while the server part 103 is 

predefined markup language; and the second vocabulary implemented in a separate processor that is most likely at a 

exclusively includes words associated with the remote remote location. The processor in which the server part 103 

runs is, in most embodiments, more powerful (e.g., faster, 

In yet another aspect of the invention, if the unrecognized 5Q more storage space, etc.) than the terminal in which the 

portion of the audio input signal includes one or more words client part 101 runs. The first digital link 105 for coupling 

defined by the second vocabulary, then the remote applica- the client and server parts 101, 103 may be wireless or 

tion part service logic causes a corresponding keyboard wireline. The data that is communicated over the first digital 

emulation response to be generated and sent to the remote Uok 105 is preferably in the form of cards and scripts/ 

server. 55 libraries created by a standardized markup language, such as 

In another aspect of the invention, if the unrecognized WML. In alternative embodiments, different markup lan- 

portion of the audio input signal includes one or more words guages can be used instead. In each case, however, the 

defined by the second vocabulary, then the remote applica- markup language should be one that is supportable by the 

tion part service logic causes a remote application part terminal's relatively low processing power and Umited 

service logic state to be changed. input/output resources. The WML is preferred for use in 

In still another aspect of the invention, the remote appli- wireless mobile terminals because its cards and scripts/ 

cation part receives text from the remote server, and gener- libraries, which can be downloaded through WAP URL 

ates a corresponding audio output signal representing audio services, can be used to create appfications that enhance and 

information. The audio output signal is formatted for inchi- extend services available in today's advanced mobile net- 

sioo in a second data unit whose structure is defined by the 65 works. 

first predefined markup language. The second data unit is The client part 101 includes a simple ASR, such as one 

communicated to the terminal via the first digital data Unk. that is capable of recognizing a small number (e.g., up to 
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about 50) isolated words. A more powerful ASR, such as one lexi into audio that may be supplied to the client part 101 (as 
capable of recognizing a large vocabulary of words supplied MIME formatted data) and played for the tiser. In this way, 
in continuous speech, is implemented in the server part 103. the user can hear the possible selections, rather than having 
In operation, the client part 101 receives speech from the to view them on the screen. The user may make selections 
user. The ASR in the client part 101 attempts to isolate and 5 by speaking them, rather than typing them. As explained 
recognize words. Those that are correctly recognized are above, the ^oken text will either be recognized and con- 
acted upon. Many of the recognized words typically would verted to text by the ASR in the client part 101, or altema- 
be used to control local functions in the terminal, such as lively by the ASR in the gateway/proxy part 107. In either 
scrolling through a menu, selecting a menu item, and case, this text is then passed by the gateway/proxy part 107 
accessing various terminal resources such as a locally stored jq lo the server 109. In this way, the server 109 need not be 
phone book. Other words may be recognized as operands specially configured to deal with a vocal interface. In fact, in 
(e.g., data) that should be supplied to the server. For these configuration the existence of the vocal interface is 
words, corresponding text is retrieved from a memory in the completely transparent to the server 109, which is aware 
terminal. This text is then sent to the server part 103 via the ^hat it sends and receives, 
first digital link 105. The text is formatted in such a way that , , invenUon will now be described in greater detail with 
the server part 103 will recognize it as data input, and will ^^^f ^° exemplary enibodiment depicted in FIGS. 2 
treat it accordinelv ^ architecture of this exemplary embodiment is 

^ , . essentially the same as that depicted in FIGS, la and lb. 

-niose words that are not recognized by the cLent part 101 However, in this embodiment the total system is logically 

ate formatted (e.g., as Multipurpose Internet Mail Extension ^^-^^^ ^^^^ ^ Terminal Part (TP) 203; a Terminal 

(MIME) types) and sent to the server part 103. The server 20 Application Part (TAP) 201; a Remote Application Part 

part 103 ascertains that this is unrecognized speech, and uses (r^P) 205; and an External Services and Content (ESC) 207 

its own, more powerful ASR to analyze the received speech. part. The TP 203 and TAP 201 embody the client part 101 

After the analysis, the server part 103 may act accordingly. of the VCSA, and the RAP 205 embodies either the server 

For example, the recognized speech may consist of com- part 103 or the gateway/proxy part 107 of the VCSA. The 

mands for controlling the server application, in which case 25 ESC 207 corresponds to the server 109. These components 

the commands are acted upon. The recognized speech may will now be described in greater detail. It will be understood 

also represent data input for the server application, in which that the various components described below are individu- 

case it is treated as such. In the event that theASR is unable ally cither well-known (e.g., various storage elements, 

to recognize the supplied speech, it may take actions such as microphone, loudspeaker), or easily implementable based 

sending encoded speech back to the client part 101, which 30 on the high-level description provided, and therefore need 

in turn plays the encoded speech to the user. The encoded be described at a high level of detail. Various embodi- 

speech in this case may be a voiced request for the user to ments may utilize one or more programmable elements 

repeat or spell out the unrecognized speech. executing a stored program to perform a number of func- 

In the alternative embodiment depicted in FIG. lb, the ^^^^^ ' ^^^^^ feature matching, maintenance of protocol 
server part 103 has been replaced by a gateway/proxy part 35 stacks, etc.). In alternative embodiments, these may instead 
107 that is coupled to a server 109 by means of a second hardwired logic gates. Whether a particular implemen- 
digital link 111. The second digital link 111 for couphng the Nation approach is better than another wiU depend on the 
gateway/proxy part 107 and the server 109 may be wireless particular application being considered, and is therefore 
or wireline. The data that is communicated over the second ^^yond the scope of this disclosure, 
digital link 111 is preferably in the form of cards and 40 The TP 203 is implemented in the terminal, and supports 
scripts/libraries created by a standardized markup language WAP standard (or alternatively another application 
that may, but need not be, the same as the data format that protocol). A TAP interface 209 is provided for allowing 
is used on the first digital link 105. When the data formats interaction with the TAP 201, which supports voice inter- 
are different, one function of the gateway/proxy part 107 is action and control of a WAP application. The TP 203 further 
to convert the data from one fonnat to the other. Conversion 45 includes a WAP client protocol stack 213 for enabling 
in this case may include not only substituting keywords from communication in accordance with the WAP standard pro- 
one format to another (e.g., from Hyper-Text Markup Lan- ^oco\ via a first data link 211, which may alternatively be 
guage (HTML) to WML), but also some level of filtering to wireless or wireline digital channel, 
weed out data that cannot be received by the terminal. For A microphone 215 is provided in the TP 203 for receiving 
example, if the server 109 is an application that is accessible 50 speech from a user of the terminal. The output of the 
via the Internet, then it may send HTML web pages that microphone 215 is supplied to a TP audio encoder (e.g., a 
include graphical information that cannot be displayed on GSM voice encoder) which encodes the audio input signal 
the relatively low power terminal. In this case, the gateway/ into a compressed data format. The encoded audio data is 
proxy part 107 needs to eliminate such data, and pass on to supplied to the TAP interface 209. When audio is to be 
the client 101 only that data that is appropriate. S5 presented to the user, it is supplied, through the TAP 

In many embodiments, it is anticipated that the data interface 209, in the compressed data format (e.g., the GSM 

formats used on the first and second data hnks 105, 111 will voice encoder format) to a TP audio decoder 219, the audio 

be the same, such as both being WML formats. In such o^^P^l of "^^^^ ^ suppUed to a speaker 221. 

cases, the conversion performed by the gateway/proxy part The TAP 201 is also implemented in the terminal for the 

107 may likely include the substitution ofvoicc data for text, 60 purpose of supporting the basic voice interaction with the 

and vice versa. That is, the server 109 may provide data in terminal functions, such as call handling, address book 

the form of large text menus that are intended to be displayed management, and the like. The TAP 201 also supports the 

on a PC screen. However, as explained above, the relatively voice interaction and control of the WAP application. The 

low power terminal may not be capable of displaying large TAP 201 includes a TP interface 223 that enables it to 

menus, and/or such large menus may be difiBcult for the user 65 communicate with the TP 203. 

to read on a small terminal screen. Thus, in one aspect of the The TAP 201 functions as a voice-oriented browser in the 

invention, the gateway/proxy part 107 converts the received terminal. The functioning of this browser will now be 
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described with refereace to the flowchart of FIG. 4, Audio passed to the RAP 205 for forwarding to the ESC 207, it is 

input is received from the microphone 215 and supplied to preferred, for efiSciency reasons, to merely send the neces- 

the TP audio encoder 217 (step 401). The output from the TP sary state (and/or other) information to the RAP 205 and to 

audio encoder 217 is supplied, through the TAP interface allow it to generate its response to the ESC 207 in the form 

209 and the TP interface 223, to a start/stop detector and s of the necessary keyboard emulation response, including but 

recording unit 225 implemented in the TAP 201 (step 403). not limited to text, binary data, state information or menu 

The TAP 201 uses the start/stop detector and recording unit selection code(s) 

225 to de^ct Uie start and stop of a supplied voice input Remrning now to decisioD block 407, if the isolated word 

signal, and to «se this to lunit the exte^ion of the audio recognized by the ASR 227, the TAP control logic 

mpm to an audio Ume mteival referred to berem as an ,o 235 in conjunction with the TAP WAP service logic 245 

"isolated word . The start/stop detector and recording unit ^^^^ ^ ^^'^^^^ ^^^^ ^^^^^^^ 

225 includes a cache memory ^not stiown; lor su.nng (y.t ^^5, ^^^^.^^ 

example, on a current 

r^rdmg) the TP audio encoded data for this isolated word. ^^^^^ ^^^^^ ^ /^^^ ^^^^^^ ^^^^ /^^^ 

The isolated word is supplied from the start/stop detector ^^^^^^^ y^^^ 235. For example, if the TAP control logic 235 

and recording unit 225 to aii ASR unit 227 for isolated word 15 ^ expecting a terminal control or menu selection function to 

recognition analysis (step 405) The ASR 227 in this exem- received, then the user might be informed (step 417) that 

plary embodiment includes a feature vector extraction unit -^^^^^^ ^^^^ recognized, and be asked to repeat 

229 which receives the isolated word and maps it into a ^^^^^^ ^^^^ ^^^^^ ^^^^^ ^p^Ui^g ^^^^ 

vector space that is suitable for use by a feature matchmg ^^^^ ^ keyboard selection. AltemaUvely, if the TAP 

and decision unit 231. A reference vocabulary, mcludmg the 20 control logic 235, in conjunction with TAP WAP service 

limited standard WAP vocabulary in the WML syntax and a ^^^.^ 245, is expecting unrecognizable audio input to be 

predefined vocabulary that is termmal dependent, is stored provided, such as for use as content for Email that is being 

in a TAP reference data base 233. The predefined, termmal- generated, then the unrecognized isolated word might sim- 

dependent vocabulary is used for extending the WML stan- ^^^^^ ^ 205 (step 419). Since the RAP's 

dard vocabulary to include words that will make the appM- ^5 aSR 307 is preferably more powerful than the TAP's ASR 

cation dialog more user friendly or to control terminal 227, the unrecognized isolated word might also be passed on 

functions that are not m the VCSA. The isolated words are ^ 205 if the TAP 203 wants assistance with the task 

preferably stored in three formats: the text format, the of recognizing the unrecognized isolated word. This aspect 

corresponding TP audio encoded data, and the associated invention is described in greater detail below 

feature vector representmg the isolated word. Feature vec- ^ . . -,.111 1 

r *i. T-An £ J * u ^11 ^ • A * In Order to pass the uurecognized ISO Utcd word ou to the 

tors from the TAP reference data base 233 are supplied to a „ ^^-^^ u ^ j j . ^ .1. . , . j * . 

^ c^x^ e ^ „ ♦ u* ^ 111 RAP 205, the audio encoded data from the start/stop detector 

second input or the feature matchmg and decision imit 231. , \. - r ..1 ^m^^mt^ r . 

.1- * t 111 tu and recording unit 225 is formatted as MIME types by a 

The feature matching and decision umt 231 compares the ^^^^^^ c -.^A^r^ • r*/ A^init- 

c ^ * r J * .u ♦ * f ™«f«, MIME formattme umt 247. Communication of the MIME- 

feature vector supplied at the output or the feature vector ^ , . 1, .i^™ 

■* lift '*u f 4 r ju *u -TAE) formatted audio encoded data is made through the IP 

extraction unit 229 with feature vectors supphed by the TAP 35 jj,terface 223 the TAP interface 209 and the WAP cUent 

reference data base 233, and determines whether a match is interlace ZZ6 the lAP interlace -^w^, and me w^jp client 

f J 4 * 'yrn '^'xci ^ iu f ™«f«k; protocol Stack to a communication RAP interlace 243 that is 

found. Outputs 237, 239 from the feature matching and ^ 1 j * .i_ . j * i- 1 ni -ru T-Amm • 1** 

J will ^^ A i *u TAn #„i 1 - TIC coupled to the first data link 211. The TAP 201 IS a chent to 

decision unit 231 are supplied to the TAP control logic 235 ^ • 1 • ^-,1 1 . j - *u tiatiiac a u 

A ■ A- * k J^of^K „roc. fr...r.A RAP scrvicc logic 321 locatcd iQ tfic RAP 205, aud ffiay be 

and indicate whether a match was round (decision block , , „ ,^jAn . • i j • -.l i 

implemented m a small WAP termmal device with low 

, , , , t r v.. processing capacity (including mobile as well as stationary 

Isolated words may be of several types: those which are \ . \ ^ >» ad ■ ^ Z: in ™ i u r ♦ * 

, f . ir • / 11- devices). The RAP service logic 321 may also be a chent to 

associated with terminal control functions (e.g., scroUmg a . . * ♦ • *u ce/^ 'ia^ 

. . , . . . - . the services and content in the ESC 207. 
menu up or down); those that determine a response that is to 

be forwarded to the RAP 205 (and perhaps ultimately to the ^s mentioned above, voice output to the user is generated 

server), such as a "SELECT' command for selecting one 45 ^ ^^^^"^ ^° ^^^P''^ "^^P^^^ ^ 

item fmm a menu (equivalent to "clicking" on a menu item speaker 221. The TP audio decoder 219 accepts data in the 

using a PC mouse); and those that are entirely defined by the TP audio encoded format from the TAP reference database 

particular server application. Thus, if an isolated word is 233 or from the RAP 205. TP audio encoded fonnat data 

recognized in the terminal ("Yes" output from decision suppUed by the RAP 205 may be received unbedded as 

block 407), a determination is then made to determine what 50 ^^^^ ^yP^^ protocol. This techmque has the 

type of isolated word it is (decision block 409). When a advantage of eliminating the need for any text-to-speech 

terminal control word is recognized, the TAP control logic conversion module in the termmal. Additional words, stored 

235 causes the terminal fiinction to be performed (step 411). TP audio encoded data in the TAP reference database 233, 

In some cases, this may include generating audio output to °^ay be used to supplement the dialog m order to make, it 

indicate to the user a change in a current terminal state, such 55 ^^^^ friendly. 

as which item in a menu is currently being selected. Turning now to the RAP server 205 (and to FIG. 3, which 

If the recognized word is service related, an appropriate depicts the RAP 205 in greater detail), it may be imple- 

response is generated as a message and transferred to the mented as a multi-user, central WAP appUcalion server, as a 

RAP (step 413), via the WAP client protocol stack 213. The WAP gateway/proxy or as a single-user local sciver, dedi- 

message might include any combination of state 60 caled to the TAP user (e.g., the user's PC, pahntop device, 

information, text, binary data and other information neces- and the like). It is anticipated that the RAP 205 will usually 

sary to aUow the RAP 205 to generate an appropriate have a more powerful processing capacity for automatic 

response that is to be forwarded to the ESC 207. The speech recognition as well as a RAP reference database for 

response generated by the RAP 205 will preferably emulate the extended vocabulary that is required for the particular 

a keyboard input selection that would be generated by a 65 service application. 

normal text-based WAP terminal. Although this keyboard As shown in FIGS. 2 and 3, the RAP 205 may also be 

response could be generated by the TAP 201 and merely implemented as a WAP gateway/proxy, connected to an ESC 
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207 in a different location. For example, the ESC 207 may in this aspect of the invention, the ASR function is actually 

be one or more application servers that provide information distributed between portions performed in the terminal and 

and content over the Internet. portions that are performed remotely, at the RAP 205. 

As indicated earlier, the RAP 205 is coupled to the first piQ ^issL flowdiart depicting an exemplary embodiment 

data link 211, and therefore has a first communication s of the overall operation of the RAP 205. If input is received 

interface 301 coupled to the first data 211 for this purpose. £^0^ jp 203 ("Yes" path out of decision block 501), it 

The first communication interface 301 is also coupled to a examined to determine what it represents (decision block 

WAP server protocol stack 303, which ensures that commu- 5Q3) jf sj^te information associated with a TP response, 

nications proceed in accordance with the WAP (or alterna- ^j^^ 205 uses it to update its own stale (e.g., the 

tively other chosen) communication protocol. The RAP 205 10 state of the RAP service logic 321) and act accordingly. This 

also includes RAP control logic 305 for controUing the may include generating a keyboard emulation response to be 

operation of other RAP resources. Among these is an ASR forwarded to the ESC 207 (step 505). As mentioned earlier, 

307 that wiU recognize the TP audio encoded words that keyboard emulation response may include, but is not 

were not recognized in the TAP 201, that is, words that were ij^iited to, text, binary data, state information or a menu 

transferred to the RAP 205 as MIME types in the WAP 15 selection code, 

protocol. To perform speech recognition, the RAP's exem- rr *u • ♦ ^ ,1 f *u iai • . ♦ * 

1 Acnin^T* 1 J f ^ * * ^ino If the input received from the TP 203 is not state 

plary ASR 307 includes a feature vector extraction unit 309, . - -* • t^jtixat^ c ^ ^ • j 

^ ... - -.^11 jriATic mformation, then it is a MIME-fonnatted unrecognized 

a feature matchmg and decision umt 311 and a RAP rerer- • 1 » j la -m.- ■ .u u ji j • ^ ™ .u 

, , T J- J J J . • isolated word. This is then handled m accordance with the 

ence database 313. In operation, TP audio encoded data is , i- / * enn\ t: 1 *u 

, . - . . particular apphcation (step 507). For example, the unrecog- 

supphed to the feature vector extraction unit. The corre- 20 • j • 1 . j j u i- j * *i. tiati* ach hvt 

^. „ , , . 1- J . f . nized isolated word may be applied to the RAP s ASR 307 

spondine feature vector is then supplied to the leature . . , r / . . * * l 

^ .^^^rm. c J * L which may, for example, generate corresponding text to be 

matching and decision unu311. The RAP reference database ^ , J[ ^, rcA -ru j- f 4 • *u- 

, j-.^j forwarded to the ESC 207. The corresponding text, in this 

313 stores the feature vectors, corresponding text and cor- , , , i- j u *l n An / j * u 111 

^ ^. J J J r 1. J i_ case, would be supplied by the RAP reference database 313. 

responding TP audio encoded data of all words to be ' . . , , . ^ 

recognized. Feature vectors from the RAP reference data- 25 unrecognized text might alternaUvely represent audio 

base 313 are supplied to another input of the feature match- content that is to be attached to, for example. Email that is 

ing and decisionunit311. The feature matching and decision forwarded to the WAP application m the ESC 207. In 

unit 311 compares the feature vector supplied by the feature ^"o^her alternative, the unrecognized text may constitute a 

vector extraction unit 309 with feature veUors suppHed by ^^n^o* ^}'^^ ^^^V .^^^^ ^P°° ^1 ^^^J^^J'f^ 

the RAP reference database 313, and indicates on output 30 ^^^^^^^ mmnng commumcation with the ESC 207, For 

lines 315, 317 whether the input word(s) was/were recog- ^^^^Pj^' unrecogm^d text may be a request for another 

nized. The ASR 307 may succeed at speech recognition part of a menu that could not be 

where the TAP's ASR 227 failed because the RAP's ASR ^ ^05 has stored the complete menu, then 

307 is preferably more powerful, and includes a larger ^.^^ P^^P^^^^^ ' commum- 

database of reference words. 35 ^^^^^8 ^^^^ 

In addition to being able to recognize isolated words, the alternative to subjecting the received TP audio 

RAP'S ASR 307 may also have the capabihty of recognizing encoded data to automated ^eech recognition is to convert 

continuous speech. This capability may be usefiil in a it into a different audio format, such as a wave formatted file, 

number of instances, including a situation in which a user of '^^'''^ ^"^^^^^ ^o, for example, an E-mail response, 

the terminal is supposed to say single word commands, but 40 conversion is performed by an audio format converter 

instead says a phrase. For example, it might be expected that ^23. The audio fonmat converter 323 is preferably bidirec- 

the user will say something like "CALL [PAUSE] JOHN ^0°^! 1° order to be capable of convertmg a voice mail 

[PAUSE]", but instead says "CALL JOHN", without any ^^^^^ (received from the ESC 207) mto TP audio encoded 

pauses between the two words. In this case, the phrase ^^^^ °^^y ^« ^^^^^^ ^° ^ 203 for the purpose of 

"CALL JOHN" could be mistaken as an isolated word by the 45 ^^^^S P^^^^^ ^ 

Start/Slop detector and recordmg unit 225, and recorded as If input is not received from the TP ("No" path out of 

such. If the TAP'S ASR 227 cannot recognize this audio decision block 501), then it must be determined whether text 

input, the TAP 201 can convert it into MIME-formatted has been received from the ESC 207 (decision block 509). 

audio encoded data and send it to the RAP 205 along with If so ("Yes" path out of decision block 509), then it is 

an indication that the TAP 201 was in a state in which it was 50 preferably supplied to a text-to-TP audio encoder 319 which 

expecting a command input. The RAP 205 in this case can generates therefrom corresponding TP audio encoded data 

respond by applying the unrecognized "isolated word" (in (step 511). This data may then be formatted into MIME 

this example, the phrase "CALL JOHN") to its more pow- types and transferred to the TP 203 in the WAP protocol 

erful ASR 307. The RAP'S ASR 307 need not be capable of (step 513). As explained earlier, the received TP audio 

recognizing all possible words that can be spoken by the ss encoded data may then be played for the user over the 

user. Instead, it may be provided with a list of recognizable speaker 221. This conversion of text to audio is required 

TP commands, and perform a so-called "wildcard" recog- when, for example, the application is reading text from the 

nition operation, in which only the TP command words are ESC 207 to the user, or when the RAP 205 is reading a stored 

looked for. Thus, if the ASR 307 looks for, among other help text to the user. When the RAP 205 may be a shared 

things, the phrase "*CALL*" (where the "*" indicates a 60 resource by clients that use a variety of different encoders, 

"donH care" word that may precede or follow the word the text-to-TP audio encoder 319 may be designed to support 

"CALL"), then the ASR 307 will detect that the imrccog- any and all of the necessary audio encoded formats that one 

nized "isolated word" consists of the word "CALL" with of the client terminals may utilize, 

another unrecognized part following it This information can In some embodiments, it may be possible to eliminate the 

then be sent back to the TAP 203. In response, the TAP 203 65 audio fonnat converter 323, and to instead look up the text 

can invoke the terminal's CALL command, and ask the user in the RAP reference database 313, and output the corre- 

lo repeat the name of the person who is to be called. Thus, sponding TP audio encoded data. The reason why it is 
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preferred to use a separate audio format converter 323, appropriate audio output instructions. This audio may 

however, is to be able to generally support services that use prompt the user to make a selection between several alter- 

a large vocabulary, such as "read my mail'* or other services natives from the service menu stored in the TAP 201. The 

that present text files, such as Help files, to the user. In these WAP coimection to the RAP 205 will be set up when a 
instances, it is not desired to store an entire dictionary in the 5 particular WAP service application has been selected. The 

encoded data in the RAP 205. service logic in the TAP 201 and RAP 205 will then start 

The RAP 205 further includes a proxy client to next level executing the service, 

of services and content unit 325 for supporting access to An exemplary service will shortly be described for the 

other external service and content providers. purpose of illustration. In order to facilitate a better under- 
Looking now at the ESC 207, it may be an appUcaUon ^° standing of the WML porUon of the example, a quick 

with or without any support for WAP applications, but may reference guide to WML 1.0 will first be presented. In this 

in either case be used as an information or content provider ^nef summary, only the WML syntax is illustrated. Values, 

to the service application in the RAP 205. ranges and defaults of attributes are not shown. This infor- 

Tbe invention takes advantage of the standardized WML "^^^ion is well-known, however, and need not be presented 
vocabulary and syntax in WAP to enable a WAP-terminal 

(i.e., a terminal with an implemented WAP client) to have a The following prolog must appear at the top of every 

voice-controlled interface to all services designed for WAP- WML deck (i.e., a .wml file): 

terminals. The service logic for the VCSA is divided <?xml version-' 1.0"?> 

between the TAP 201 and the RAP 205 in the appHcation. All <!DOCTYPE WML PUBLIC "-/AVAPFORUM//DTD 

local interactions between the TAP 201 and the TP 203 arc WML 1 .0//EN""http ://www.wapforum.org/DTD/ 

handled by TAP WAP service logic 245 in order to minimize wml.xml"> 

transmissions between the TAP 201 and the RAP 205. The <!- This is a comment.-* 

TAP WAP service logic 245 issues orders that are carried out Every deck has exactly one <WML> element: 

by the TAP control logic 235 for controlling the data and <WML xml:language-'"'> 

information flow within the TAP 201. In another optional <HEAD> . . . </HEAD> 

aspect of the invention, the TAP control logic 235 may also <TEMPLArE> . . . </TEMPLArE> 

have the capability of inserting supporting text and words to <CARD> </CARD > 

enhance and improve the dialog with the user compared with </WML> 

the very limited vocabulary in the WML syntax Such ^ ^ optionaUy have exactly one <HEAD> ele- 

additional text might, for example, be m the form of audio menf 

that explains to the user in greater detail what steps must be <HEAD> 
performed to make a particular menu selection. This addi- 

tional vocabulary may be stored in the TAP reference <ACCESS DOMAIN="'TAIH=-PUBLICo"'V> 

database 233 as TP audio encoded data strings. ^^^^^ NAME=JHTTP.EQUIV=""USER -AGENT- 

Alternatively, the additional vocabulary may be requested ""C0NTENT=""SCHEME="'7> 

from the RAP reference database 313, and transferred, as TP </HEAD> 

encoded audio data, over the first data link 211 (WAP Each deck may optionally have exactly one <TEMPLArE> 

channel) to the TP 203, which can play the audio for the user element: 

over the speaker 221. <TEMPLATE ONENTERFORWARD- 

In accordance with another aspect of the invention, it is " ONENTERBACKWARD« * 

possible to update, enhance or replace the vocabulary in the <ONTIMER=""> 

TAP reference database 233 with a complete set of text, <D0> . . . </D0> 

encoded TP audio data and feature vectors supplied through <ONEVENT> </ONEVENT> 

the RAP 205. The newly downloaded information may </TEMPLArE> 

represent changes in the WML, or even a new language. ^^ch deck has at least one <CARD >element: 

The TAP WAP service logic 245 may be a client to RAP <CARD NAME = ""TITLE = ""NEWCONTEXT= 

service logic 321 located in the RAP 205. The TAP WAP "'*STYLE = " "ONENTERFORWARD = 

service logic 245 only controls the TP and TAP functions, ""ONENTERBACKWARD=""ONTlMER=""> 
and executes the basic WML syntax. It does not support the 50 <D0> </D0> 

application-dependent part of the VCSA. Th, TAP WAP <oNEVENT> </ONEVENT> 

service logic 245 and RAP service logic 321 are synchro- <uiNiivi:[N i> . . . </uiNi;vt:jNi> 

nized during a service application. The RAP service logic <TIMER . . . / > 

321 and the vocabulary for supporting a new VCSA may be <INPUT . . . /> 
downloaded to the RAP 205 from an external service 55 <SELECT> . . . </SELECT> 

P^^^*^^*** <FIELDSET> . . . </FIELDSET> 

In an exemplary embodiment, in order to activate the Cards can contain text flow with markup (such as <B > 

VCSA, the user may speak a predefined voice command, bold </B>) including images <IMG > and anchors <A>. 

such as the word "services". In response, the TP 203 may, for </CARD> 

example, convert this speech into TP audio encoded data, 60 Navigations are indicated by the <;D0> element: 

and supplies it to the TAP 201 for recognition. Assuming that <£)0 TYPE=""LABEL=""NAME«"''OPnONAL=""> 

the user's command is recognized by the TAP ASR 227, TP <G0> </G0> 

encoded audio that is provided by the TAP reference data- pppv /pt?pv 

base 233 is converted into an audio signal by the TP audio <PREV> . . . </PREV> 
decoder 219 and supplied to the speaker 221. The TAP WAP 65 <REFRESH>. . . </REFRESH> 

service logic 245 is responsible for assembling the words <NOOP . . . /> 

into a text string, and the TP control logic 235 executes the </D0> 
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Events are handled 
<T[MER> clement: 
<ONEVENT TYPE«""> 
<G0> . . . </G0> 
<PREV > . . . </PREV> 
<REFRESH> . . . </REFRESH> 
<NOOP . . . / > 
</ONEVENT> 

<'nMER KEY=*"'DEFAULT«""/> 

Specific actions are one of <GO>, <PREV>, 
<REFRESH>, or <NOOP> elements: 
<G0 URL=""SENDREFERER=""METHOD»""ACCEPT- 

CHARSET=""POSTDATA=""> 

<VAR NAME=""VALUE=.""/> 
</G0> 
<PREV> 

<VAR NAME-""VALUE=""/> 
</PREV> 
<REFRESH > 

<VAR NAME=""VALUE«""/> 
</REFRESH> 
<NOOP/> 

Hints how to group input fields can be provided with the ^ 
<FIELDSET> element: 
<nELDSET TITLE-^"'> 

<INPUT . . . /> 

<SELECT> . . . </SELECr> 

<nELDSET> . . . </FIELDSET> 
</FlELDSET> 

Input is obtained by one of <INPUT> or <SELECT> 
elements: 

<INPUT KEY«""TYPE=""VALUE=""DEFAULT» 35 
" "FORMAT-" "EMPTYOK = " "SIZE« 
«"MAXLENGTH="'TABINDEXo"'TlTL£='"7> 

<SELECT TITLE-""KEY-""DEFAULT-""IKEY- 
""IDEFAULT=""MULTIPLE='"TABINDEX+""> 
<OPTGROUP> . - . </OPTGROUP> 
<OPTION> . . . </opnoN> 

</SELECT> 

Selection list elements can be grouped using an <OPT- 
GROUP> element: 
<OPTGROUP 'nTLE»""> 

<OPTION> . . . </OPnON> 

<0PTGROUP> . . . </0PTGROUP> 
</OPTGROUP> 

Selection list elements are specified using an <OPnON> 
element: 

<OPnON VALUE-""nTLE-""ONCLICK-""> 

Options have text flow with markup, but no images or 
anchors. 

<ONEVENT> . . . </ONEVENT> 
</OPTION> 

Text flow with markup includes the following elements: 
<B> . . . </B>bold 
<!> . . . </l> italic 
<U> . . . </\J> underlined 
<BIG> . . . </BIG> increased font size 

<SMALL>... </SMALL> decreased font size 
<EM> . . . </EM> emphasized 
<STRONG> . . . </STRONG> strongly emphasized 
<BR ALIGN-""MODE-""/> force Une breaks 
<TAB ALIGN=""/> align subsequent text in columns 
<A TITLE-""> an anchor tag embedded in text flow. 
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<G0> . . . </GO> 

<PREV> . . , ^REV> 
<REFRESH> . . . </REFRESH> 

Anchors have text flow with markup, but no images or 
anchors. 
</A> 

Images are indicated with the <IMG> element: 
<IMG ALT»""SRC-""LOCALSRC=""VSPACE= 
""HSPACE«""AUGN="'*HEIGHT=""WIDTH=""/> 
The exemplary WAP service will now be described. 
Suppose a weather information service is available for a 
WAP -enabled terminal having display/keypad interaction. 
The service might first present the user with a list of options 
that, on the screen, look like: 
Show weather for: 
>Stockholm 
Helsinki 
Zurich 
Other 

By pressing the UP or DOWN key, the user can move the 
cursor (i.e., the character) up or down the list. By 
pressing an ACCEPT key (on some mobile telephones, such 
as those manufactured and sold by Ericsson, this is the YES 
key), the user causes a short code for the selected city to be 
sent to the weather service provider. 

If "Other*' is selected, then an input field is provided to the 
user: 

Enter city name: 

The user then activates appropriate device keys to enter 
the city name, foUowed by an ENTER key. 

The WML for this service looks like the following: 
<WML> 

<TEMPLArE> 

<!- This vidll be executed for each card in this deck-* 
<D0 TYPE«"ACCEPT"> 

<G0 URL-"http://weather.com/by-city?$(city)7 > 
</D0> 

</TEMPLArE> 

<CARD NAME-"cilies"> 

Show weather for: 

<SELECT TITLE="city" IKEY="N" IDEFAULT="r' 
KEY="city"> 

<0Pn0N VALUE="STHM">Stockholm </OPTION> 
<OPnON VALUE="HSKI">Helsinki </OPTION> 
<OPnON VALUE-"ZURH">Zurich </OPTION> 
<OPTION ONCLICK-"#other"> Other</OPTION> 

</SELECT> 

</CARD> 

<CARD NAME="other"> 
Enter city name: 

<INPUT VALUE="city" TYPE="TEXr7> 

<ICARD> 

<AVML^ 

Using the above-described inventive techniques in a 
speech recognition enabled terminal, the user would hear: 

"Show weather for these city options" 

Note that it combined "Show weather for'*, the TITLE 
attribute of the SELECT tag, "dty", and some glue text from 
the TAP reference database 233, "these" and "options'*. This 
can be device implementation dependent or predefined as 
supplementary words to the WML vocabulary in relation to 
the syntax. 

The user then hears the device state the names in the list 
with a short pause between each. 
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"Stockholm" [PAUSE] 
"Helsinki" [PAUSE] 

The purpose of the pa\ise is to give enough time for the 
user to re^ond with something, such as: 
"ACCEPT", meaning select this one, or 
"NO", meaning next, or 

"BACKOUT", meaning go back to the previous screen 
full etc. 

If the user responds with "ACCEPT" to the "Other" 
option, then the device says: 

"Enter city name, end with O.K., or pause for 2 seconds." 

Note how the device combined the given text, and the 
instructions to end the input. 

The user then speaks the city name, ending with "O.K." 
Now the device sends the spoken input to the remote 
application for speech recognition and ftirther processing. 

The various aspects of the invention enable a terminal 
device, with relatively low processing capability and I/O 
devices that may be cumbersome (e.g., very small) or 
relatively unavailable (e.g., while driving), to use an inter- 
active voice interface to access service appUcations that are 
developed for general use by terminals that do not have these 
limitations. Complexity of the ASR requirements in the 
terminal is reduced by separating the speech recognition 
system for the VCSA into a small terminal speech recognizer 
for the standard markup language (e.g., WML) syntax and a 
more powerful speech recognizer for the application depen- 
dent part of the VCSA in a remote device having more 
processing power. As a consequence of this arrangement, no 
modifications of the service content are necessary. 

Another advantage of the invention derives from the fact 
that it is unnecessary to establish a voice channel between 
the terminal and the remote application server. This is 
because the audio response to the application is encoded into 
predefined digital types, such as MIME types, which are 
transmitted over a digital data channel. 

An additional benefit of the invention is that it provides a 
general way to standardize and limit the voice dialog 
vocabulary for any voice controlled service by using a 
standardized markup language like WML. This simplifies 
the task of voice recognition, and reduces the errors that 
would otherwise occw from the presence of different pro- 
nunciations of words in a mulU-xiser application. 

The invention can also provide a way to determine the end 
of the user's spoken response to a prompted question or 
selection, defined by the application, by the insertion of an 
instruction in the question or optional selection. The instruc- 
tion tells the user how to end the answer by, for example, 
saying a special predefined word or by allowing a predefined 
period of silence to occur that is recognizable by the terminal 
device. When the user says the predefined word or pauses for 
the predefined period of time, this is recognized by the ASR 
227 in the terminal, enabling the terminal to recognize that 
what came before was the requested response. 

The invention enables the implementation of interaclive 
voice controlled services in different embodiments. 
Examples of these include, but are not limited to, the 
following: 

Voice browser on a WAP-enabled phone; 

Voice -enabled control unit that is digitally connected to 

the control function in a processing unit; 
Voice -enabled special devices, such as electronic note- 
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Maice-enabled control of computer applications, such as 
Application Program Interfaces (APIs) in windows- 
based operating systems and client/server environ- 
ments; and 
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Voice-enabled control of standardized application proto- 
cols based on a variety of markup or script languages 
with a small and defined vocabulary in the interactive 
application protocol. 

The invention has been described with reference to a 
particular embodiment. However, it will be readily apparent 
to those skilled in the art that it is possible to embody the 
invention in specific forms other than those of the preferred 
embodiment described above. This may be done without 
departing from the spirit of the invention. The preferred 
embodiment is merely illustrative and should not be con- 
sidered restrictive in any way. The scope of the invention is 
given by the appended claims, rather than the preceding 
description, and all variations and equivalents which fall 
within the range of the claims are intended to be embraced 
therein. 

What is claimed is: 

1. A method of controlling a service application provided 
to a terminal from a remote server, the method comprising 
the steps of: 

receiving an audio input signal representing audio infor- 
mation; 

using a first automatic speech recognition system located 
in the terminal to determine whether the audio input 
signal includes one or more words defined by a first 
vocabulary, wherein portions of the audio input signal 
not corresponding to the one or more words defined by 
the first vocabulary constitute an unrecognized portion 
of the audio input signal; 

if the audio input signal includes one or more words 
defined by the first vocabulary, then using a terminal 
application part of an application protocol service logic 
to determine what to do with the one or more words 
defined by the first vocabulary; 

formatting the unrecognized portion of the audio input 
signal for inclusion in a data unit whose structure is 
defined by a first predefined markup language; 

communicating the data unit to a remote application part 
via a first digital data link that operates in accordance 
with a first application protocol; and 

in the remote application part, extracting the formatted 
unrecognized portion of the audio input signal from the 
data unit and using a remote application part service 
logic to determine what to do with the formatted 
imrecognizcd portion of the audio input signal. 

2. The method of claim 1, wherein the audio input signal 
is in the form of compressed digitally encoded speech. 

3. The method of claim 1, wherein if the audio input 
signal includes one or more words defined by the first 
vocabulary, then the terminal appUcation part of the appli- 
cation protocol service logic causes the one or more words 
to be used to select one or more terminal functions to be 
performed. 

4. The method of claim 3, wherein the one or more 
terminal functions include selecting a current menu item as 
a response to be supplied to the remote server. 

5. The method of claim 3, wherein: 

a current menu item is associated with a first selection; 
and 

the one or more terminal functions include associating the 
current menu item with a second selection that is not 
the same as the first selection. 

6. The method of claim 1, wherein if the audio input 
signal includes one or more words defined by the first 
vocabulary, then the terminal application part of the appli- 
cation protocol service logic causes a corresponding mes- 
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sage to be generated and communicated to the remote 
application part via the first digital data link. 

7. The method of claim 6, wherein the corresponding 
message includes state information. 

8. The method of claim 6, wherein the corresponding 5 
message includes text. 

9. The method of claim 6, wherein the corresponding 
message includes binary data. 

10. The method of claim 6, wherein the remote applica- 
tion part forwards the corresponding message to the remote lo 
server. 

11. The method of claim 10, wherein the remote apph- 
cation part forwards the conesponding message to the 
remote server via a second digital data link that operates in 
accordance with a second application protocol. is 

12. The method of claim 11, wherein the first application 
protocol is the same as the second application protocol. 

13. The method of claim 1, further comprising the steps 

of: 

using a second automatic speech recognition system 20 
located in the remote application part to determine 
whether the unrecognized portion of the audio input 
signal includes one or more words defined by a second 
vocabulary; and 

if the imrecognized portion of the audio input signal 25 
includes one or more words defined by the second 
vocabulary, then using the remote application part 
service logic to determine what to do with the one or 
more words defined by the second vocabulary, 

14. The method of claim 13, wherein: 30 
the first vocabulary exclusively includes words defined by 

a syntax of the first predefined markup language; and 
the second vocabulary exclusively includes words asso- 
ciated with the remote server. 

15. The method of claim 13, wherein if the unrecognized 
portion of the audio input signal includes one or more words 
defined by the second vocabulary, then the remote applica- 
tion part service logic causes a corresponding keyboard 
emulation response to be generated and sent to the remote 
server. 

16. The method of claim 13, wherein if the unrecognized 
portion of the audio input signal includes one or more words 
defined by the second vocabulary, then the remote applica- 
tion part service logic causes a remote application part 
service logic state lo be changed. 

17. The method of claim 1, further comprising the steps 

of: 

in the remote application part, receiving text from the 
remote server; 

in the remote application part, generating a corresponding 
audio output signal representing audio information; 

formatting the audio output signal for inclusion in a 
second data unit whose structure is defined by the first 
predefined markup language; 

communicating the second data unit to the terminal via 
the first digital data link; and 

in the terminal, extracting the audio output signal from the 
second data unit and generating therefrom a loud- 
speaker signal (JO 

18. An apparatus for controlling a service application 
provided to a terminal from a remote server, the apparatus 
comprising: 

means for receiving an audio input signal representing 
audio information; 65 

a first automatic speech recognition system located in the 
terminal for determining whether the audio input signal 
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includes one or more words defined by a first 
vocabulary, wherein portions of the audio input signal 
not corresponding to the one or more words defined by 
the first vocabulary constitute an imrecognized portion 
of the audio input signal; 
a terminal application part of an application protocol 
service logic for determining what to do with the one or 
more words defined by the first vocabulary if the audio 
input signal includes one or more words defined by the 
first vocabulary; 

means for formatting the unrecognized portion of the 
audio input signal for inclusion in a data unit whose 
structure is defined by a first predefined markup lan- 
guage; 

means for communicating the data unit to a remote 
application part via a first digital data link that operates 
in accordance with a first application protocol; and 

the remote application part, comprising: 

means for extracting the formatted unrecognized portion 
of the audio input signal from the data unit; and 

a remote application part service logic for determining 
what to do with the formatted unrecognized portion of 
the audio input signal. 

19. The apparatus of claim 18, wherein the audio input 
signal is in the form of compressed digitally encoded speech. 

20. The apparatus of claim 18, wherein the terminal 
application part of the application protocol service logic 
comprises: 

means for causing the one or more words to be used to 
select one or more terminal functions to be performed 
if the audio input signal includes one or more words 
defined by the first vocabulary. 

21. The apparatus of claim 20, wherein the one or more 
terminal functions include selecting a current menu item as 
a response to be supplied to the remote server. 

22. The apparatus of claim 20, wherein: 

a current menu item is associated with a first selection; 
and 

the one or more terminal functions include associating the 
current menu item with a second selection that is not 
the same as the first selection. 

23. The apparatus of claim 18, wherein the terminal 
application part of the application protocol service logic 
comprises: 

means for causing a corresponding message to be gener- 
ated and communicated to the remote application part 
via the first digital data link if the audio input signal 
includes one or more words defined by the first vocabu- 
lary. 

24. The apparatus of claim 23, wherein the corresponding 
message includes state information. 

25. The apparatus of claim 23, wherein the corre^onding 
message includes text. 

26. The apparatus of claim 23, wherein the corresponding 
message includes binary data, 

27. The apparatus of claim 23, wherein the remote appli- 
cation part includes means for forwarding the corresponding 
message to the remote server. 

28. The apparatus of claim 27, wherein the remote appli- 
cation part includes means for forwarding the corresponding 
message to the remote server via a second digital data link 
that operates in accordance with a second application pro- 
tocol. 

29. The apparatus of claim 28, wherein the first applica- 
tion protocol is the same as the second application protocol. 
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30. The apparatus of claim 18, further comprising: 

a second automatic speech recogDilion system located in 
the remote application part for determining whether the 
unrecognized portion of the audio input signal includes 
one or more words defined by a sccoad vocabulary, and 

wherein the remote application part service logic includes 
means for determining what to do with the one or more 
words defined by the second vocabulary if the unrec- 
ognized portion of the audio input signal includes one 
or more words defined by the second vocabulary. 

31. The apparatus of claim 30, wherein: 

the first vocabulary exclusively includes words defined by 
a syntax of the first predefined markup language; and 

the second vocabulary exclusively includes words asso- 
ciated with the remote server. 

32. The apparatus of claim 30, wherein the remote appfi- 
cation part service logic comprises: 

means for causing a corresponding keyboard emulation 
response to be generated and sent to the remote server 
if the unrecognized portion of the audio input signal 
includes one or more words defined by the second 
vocabulary. 
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33. The apparatus of claim 30, wherein the remote appU- 
cation part service logic comprises: 

means for causing a remote application part service logic 
state to be changed if the tmrecognized portion of the 
audio input signal includes one or more words defined 
by the second vocabtdary. 

34. The apparatus of claim 18, further comprising: 
means, in the remote application part, for receiving text 

from the remote server; 
means, in the remote application part, for generating a 
corresponding audio output signal representing audio 
information; 

means for formatting the audio output signal for inclusion 
in a second data unit whose structure is defined by the 
first predefined markup language; 

means for communicating the second data tinit to the 
terminal via the first digital data fink; and 

means, in the terminal, for extracting the audio output 
signal from the second data unit and generating there- 
from a loudspeaker signal. 
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